Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

e 7.20(a) shows the Lasso discrimination model constructed for

ating between the USA sequences and the India sequences. In

rimination model, the top five words which had the best

ation power were TAC, AGC, ATG, CAT and GTC for

ating the genomic patterns between these two countries. Figure

shows the Lasso discrimination model constructed for

ating between the USA sequences and the Russia sequences. In

el, the top five words were ATA, AGC, GAA, AAC and TAT.

(a) (b)

(a) The Lasso discrimination model constructed for discriminating the USA

against the India sequences. (b) The Lasso discrimination model constructed for

ing the USA sequences against the Russia sequences.

e 7.21 shows a hierarchical cluster generated using the kmer

based on the 3-mer word library for randomly selected five USA

s and five India sequences. It can be seen that the USA and India

s generally formed two distinct clusters.

The hierarchical cluster generated by the kmer package for randomly selected

ces from USA and five sequences from India.